[SPARK-8124] [SPARKR] [WIP] Created more examples on SparkR DataFrames #6668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

Emaasit wants to merge 15 commits into apache:master from Emaasit:dan-dev

Contributor

Emaasit commented Jun 5, 2015

Here are more examples on SparkR DataFrames including creating a Spark Contect and a SQL
context, loading data and simple data manipulation.


          Created more examples on SparkR DataFrames

d705112

Here are more examples on SparkR DataFrames including creating a SQL
context, loading data and simple data manipulation

Contributor Author

Emaasit commented Jun 5, 2015

@shivaram Here is the new submission. I would like to submit a few more examples on statistical modeling and machine learning on SparkR DataFrames.

shivaram reviewed

View reviewed changes

examples/src/main/r/0-getting-started.R Outdated

Contributor

shivaram Jun 5, 2015

We need to have the Apache License at the top of every file. You can see https://github.com/apache/spark/blob/master/examples/src/main/r/dataframe.R#L1 for an example

Also per our style guide we don't put in Author names / dates in the file itself as this is tracked in the commit log

Emaasit added 3 commits

June 5, 2015 10:51


          Added the Apache License at the file

486f44e


          Added the Apache License at the top of the file

2e8f724


          Added the Apache License at the top of the file

275b787

Contributor Author

Emaasit commented Jun 5, 2015

@shivaram I have added the Apache license at the top of every file, removed author name & date.

shivaram reviewed

View reviewed changes

examples/src/main/r/0-getting-started.R Outdated

Contributor

shivaram Jun 8, 2015

This comment should probably be 'Load SparkR library into your R session'


          Updates to a comment and variable name

Now using sqlContext as the variable name

shivaram reviewed

View reviewed changes

examples/src/main/r/2-data-manipulation.R Outdated

Contributor

shivaram Jun 8, 2015

This should be describe and not Describe ?

Emaasit added 7 commits

June 7, 2015 19:31


          provided two options for creating DataFrames

8e0fe14

provided two options for creating DataFrames. Option 1: from local data frames and option 2: directly create DataFrames using read.df function


          changed variable name to SQLContext

c6933af


          combined all the code into one .R file

cc55cd8

Deleted the source() function and combined all the code into one file


          Deleted this file

b95a103


          Deleted the getting-started file

90565dd

Deleted the getting started file and combined all the code into one file


          changed "Describe" function to "describe"

b6603e3


          Renamed file

33f9882

Renamed file to data-manipulation.R

Contributor Author

Emaasit commented Jun 8, 2015

@shivaram I wanted to provide two options for creating DataFrames. One where R users can convert their local dataframes into DataFrames and the second using the read.df().

shivaram reviewed

View reviewed changes

examples/src/main/r/data-manipulation.R Outdated

Contributor

shivaram Jun 8, 2015

Would read.csv that is part of base R also work for this ? I know that data.table is more efficient, but I would like to avoid installing new of packages in the example.


          Used base R functions

a550f70

Replaced the data.table function (fread) with base R function for reading csv files (read.csv)

Contributor Author

Emaasit commented Jun 8, 2015

@shivaram Yes, the base R function works. I have changed it.

shivaram reviewed

View reviewed changes

examples/src/main/r/data-manipulation.R Outdated

Contributor

shivaram Jun 9, 2015

Could we take this in as a command line argument ? I think something like

args <- commandArgs(trailing = TRUE)
if (length(args) != 1) {
  print("Usage: data-manipulation.R <path-to-flights.csv")
  print("The data can be downloaded from: https://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv ")
  q("no")
}
flightsCsvPath <- args[[1]]

should do the trick


          Using command line arguments

f7227f9

Taking in data set as a command line argument

Contributor Author

Emaasit commented Jun 9, 2015

@shivaram I fixed that. You will notice that read.csv() does not work well with SSL, that is https connections. so I changed the connection to http.

shivaram reviewed

View reviewed changes

examples/src/main/r/data-manipulation.R Outdated

Contributor

shivaram Jun 9, 2015

This should be sparkRSQL and not SparkRSQL

shivaram reviewed

View reviewed changes

examples/src/main/r/data-manipulation.R Outdated

Contributor

shivaram Jun 9, 2015

So I tried to run this locally and this step is very slow for the dataset we are using here (I filed https://issues.apache.org/jira/browse/SPARK-8277) due to the way we convert local data frames to lists.

I see two options here: (1) Use fewer rows in the example file, so that this runs fast or (2) use a different dataset to demonstrate creating a SparkR DataFrame from a local dataframe (the CSV reader is fine)

Let me know which you think is better.


          Used fewer rows for createDataFrame

3a97867

To create a SparkR DataFrame, I used fewer rows of the local data frame.

Contributor Author

Emaasit commented Jun 10, 2015

@shivaram To create a Spark DataFrame from a local data frame, I used a subset of the data with fewer rows.

shivaram reviewed

View reviewed changes

examples/src/main/r/data-manipulation.R

Contributor

shivaram Jun 10, 2015

This line should also go inside the if block

Contributor

shivaram commented Jun 10, 2015

Jenkins, ok to test

shivaram reviewed

View reviewed changes

examples/src/main/r/data-manipulation.R

Contributor

shivaram Jun 10, 2015

The source here needs to be com.databricks.spark.csv

BTW @rxin is there some way we can map source = csv to that automatically ?

Contributor

rxin Jun 11, 2015

not if csv is outside this ... maybe we can provide a way for data sources to register short names.

Contributor

shivaram commented Jun 10, 2015

Thanks @Emaasit for the update. I just had a few more things that I ran into while executing the example. Also you can verify some of these things by just running the example on your machine -- I just used a command of the form

./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3 examples/src/main/r/data-manipulation.R ./flights.csv

to check things

SparkQA commented Jun 10, 2015

Test build #34619 has finished for PR 6668 at commit 3a97867.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Contributor Author

Emaasit commented Jun 10, 2015

@shivaram Ok. Got you.

Contributor

shivaram commented Jul 6, 2015

LGTM. Thanks @Emaasit for this PR. There are some outstanding comments, but I'll fix them during the merge.

Contributor Author

Emaasit commented Jul 6, 2015

Thanks @shivaram.

asfgit closed this in

293225e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet